Dual-branch hybrid network for lesion segmentation in gastric cancer images

The effective segmentation of the lesion region in gastric cancer images can assist physicians in diagnosing and reducing the probability of misdiagnosis. The U-Net has been proven to provide segmentation results comparable to specialists in medical image segmentation because of its ability to extract high-level semantic information. However, it has limitations in obtaining global contextual information. On the other hand, the Transformer excels at modeling explicit long-range relations but cannot capture low-level detail information. Hence, this paper proposes a Dual-Branch Hybrid Network based on the fusion Transformer and U-Net to overcome both limitations. We propose the Deep Feature Aggregation Decoder (DFA) by aggregating only the in-depth features to obtain salient lesion features for both branches and reduce the complexity of the model. Besides, we design a Feature Fusion (FF) module utilizing the multi-modal fusion mechanisms to interact with independent features of various modalities and the linear Hadamard product to fuse the feature information extracted from both branches. Finally, the Transformer loss, the U-Net loss, and the fused loss are compared to the ground truth label for joint training. Experimental results show that our proposed method has an IOU of 81.3%, a Dice coefficient of 89.5%, and an Accuracy of 94.0%. These metrics demonstrate that our model outperforms the existing models in obtaining high-quality segmentation results, which has excellent potential for clinical analysis and diagnosis. The code and implementation details are available at Github, https://github.com/ZYY01/DBH-Net/.

Gastric cancer is one of the common malignant tumors, with more than 1 million new patients yearly 1 . The incidence of gastric cancer ranks among the top three cancers in China, with a mortality rate of 12.4% 2 . In terms of morbidity and mortality, gastric cancer is considered to be a severe and lethal malignancy 3 . Gastroscopy is the most common method of detecting and diagnosing gastric cancer. It highly relies on a great deal of expertise and practical experience by trained doctors. Research showed that the accuracy of manual gastroscopy is only 69-79% 4 . With the deep learning algorithms introduced into medical image segmentation, many studies have used Convolutional Neural Networks (CNNs) to segment gastric cancer images. Hirasawa et al. 5 achieved automatic detection of gastric cancer in endoscopic images using CNNs, but the accuracy is limited due to ambiguous lesion features. PAN et al. 6 identified early gastric cancer and non-cancerous images by improving the SSD model. The DSF module was proposed to achieve an effective fusion of features at different levels. ZHANG et al. 7 proposed an enhanced SSD architecture called SSD-GPNet. It takes advantage of the cross-layer relationship in the feature pyramid to increase the receptive field of the network and enhance feature extraction. Although using CNNs achieves a better recognition effect, the result can still not meet the requirements of complementary medical diagnosis. This prompted us to seek more targeted network structures to improve segmentation performance.
Ronneberger et al. 8 proposed the U-Net in 2015, which uses skip connections to make the final restored feature map incorporate more low-level feature information and has wide application in medical image segmentation. Many studies have improved the U-Net to gastric cancer lesion segmentation. QIU et al. 9 identified certain types of lesion sites in gastric cancer using an improved U-Net model based on pyramidal structure. ZHANG et al. 10 developed a modified U-Net network that enhances the fusion of high-level and low-level feature information by designing SERES and DAGC modules to replace the pooling operation. Although the improved U-Net method has been proven more effective, its inherent limitations lead to its lack of capability in modeling explicit long-range relations. Due to the number of folds in gastric mucosa, the complexity of the Related works U-shaped networks. The semantic structure of medical images is relatively simple, so their high-level semantic information and low-level features are essential. The U-Net has achieved a good performance in medical image segmentation by improving skip connection and providing more detailed information. Many variants of U-Net have achieved excellent performance. Oktay et al. 23 proposed the Attention U-Net model, which incorporates integrated attention gates (AGs) to recalibrate the output features of the coding and effectively suppresses irrelevant noise to highlight the salient features of hopping connection delivery. Li et al. 24 proposed an attention-based nested segmentation network, ANU-Net, which performs well on the liver tumor segmentation dataset LiTS by redesigning dense skip connections. Ni et al. 25 proposed the RAUNet to solve the problem of specular reflection in cataract segmentation by adding an enhanced attention module to fuse multi-level features and capture contextual information effectively. MZ Alom et al. 26 proposed the R2U-Net, which combines the advantages of U-Net, residual network, and RCNN network, and has better performance in retinal image segmentation tasks with the same number of parameters. ZHOU et al. 27 proposed a segmentation architecture (UNet + +) based on nested dense skip connections, which demonstrates effectiveness on abdominal CT liver segmentation datasets and colonic polyp segmentation datasets. The above research confirms that the U-Net has become one of the most popular deep learning frameworks in medical image segmentation with good segmentation performance. www.nature.com/scientificreports/ Transformers applications. Transformer is widely used in many NLP tasks with good performance. The VIT (Vision Transformer) was first proposed by 28 for image processing in 2020. It showed results comparable to the CNNs at that time but required significantly fewer computational resources to train. From then on, many studies have worked on solving medical image segmentation problems by using Transformer. Valanarasu et al. 29 proposed the MedT model to solve the poor performance of Transformer on small medical datasets. It used a gated axial-attention model, which extends the existing architectures by introducing an additional control mechanism in the self-attention module. Ji et al. 30 proposed the Multi-Composite Transformer (MCTrans), which integrated rich feature learning and semantic structure mining into a unified framework. Gao et al. 31 proposed the UTNet, which applied self-attentive modules in the encoder and decoder to capture long-range dependencies with minimal overhead. Zhang et al. 32 proposed Multi-Branch Hybrid Transformer Network (MBT-Net) based on a body-edge Branch to obtain more details and contextual information. Cao et al. 33 proposed Swin-Unet, a U-shaped encoder-decoder structure based on Swin-Transformer blocks. It developed patch expanding layers to achieve up-sampling and feature dimensionality increase without convolution or interpolation operations. Lin et al. 34 proposed DS-TransUNet to improve the problem of ignoring the intrinsic structural features at the pixel level during patch segmentation. Proposed TIF module to achieve efficient interaction at multi-scale features using MSA mechanism. The above studies confirm that Transformers are widely used in medical image segmentation and perform well.

Method
The overall framework of our proposed end-to-end Dual-Branch Hybrid Network is shown in Fig. 1. The U-Net branch extracts spatial information at each scale, and the Swin-Transformer branch captures global contextual information. To obtain the feature information of the salient lesion regions extracted from the two branches and reduce the complexity of the model, we propose the Deep Feature Aggregation Decoder (DFA) to aggregate the deep features to recover the spatial details of the lesion region and output the loss value between segmentation result and the ground truth label. In addition, the features extracted from both branches are fed into the Feature Fusion (FF) module for processing and passed into the Decoder module via a skip connection. The Decoder structure recovers the details of the image and the corresponding spatial dimensions to output the loss value and the final segmentation results. In addition, we combine the loss values obtained from the three components by weighting them for joint training to maximize the advantages of the two branches.
Swin-transformer branch. The design of the Swin-Transformer branch follows the typical encoderdecoder architecture. In this case, the encoder architecture uses the Swin-Transformer architecture proposed in 20 . The decoder structure uses our proposed DFA module. The overall framework shown in the yellow dashed box in Fig  The Transformer architecture uses the Multi-head Self Attention (MSA) module to compute global selfattention for feature learning, which results in computationally intensive and high model complexity. The Swin-Transformer block introduces the idea of local calculation, calculating self-attention in the window region without overlap, significantly reducing computational complexity. The general structure is shown in Fig. 2. Specifically, the Swin-Transformer block comprises two sets of Layer-Norm (LN) layers, the window-based MSA layer, a residual connection, and a 2-layer Multilayer Perceptron (MLP) unit. In this case, the window-based W-MSA module calculates the self-attention only for each window's interior. In contrast, the shifted windowbased module (SW-MSA) is used to solve the problem of window-to-window information transfer. Based on such a window partitioning mechanism, the Swin-Transformer block can be formulated in Eqs. The decoder uses our proposed DFA module to replace the original decoder in U-Net. The feature maps extracted from the last three convolutions are passed into the DFA module to output the segmentation results. The general structure is shown in the blue dotted box in Fig. 1. From experience, we set the value of hyperparameter C to 96.
Deep feature aggregation decoder DFA. In order to output the segmentation results of the U-Net and the Swin-Transformer, we need to build a decoder structure to recover the image information. As we focus on the segmentation results of the Transformer and the U-Net for salient lesion regions, fast and accurate positioning is our primary objective. Therefore, to accurately locate the gastric cancer lesion region and reduce the complexity of the model, we propose the Deep Feature Aggregation Decoder (DFA) to eliminate the influence of low-level features on the computational complexity and recover the spatial detail of the lesion region. The structure of the module is shown in Fig. 3. We aggregate the output features of the last three modules F i , i = 1, 2, 3 . In order to obtain global information on deeper features, we introduced the Receptive Field Block (RFB) 35 to increase the receptive field. Compared to the conventional RFB module, we add a convolutional layer with a dilation rate of 7 and reduce the channel to 48 to decrease the computational loss of extracted features, as shown in module RFB 48 in Fig. 3. We construct two aggregated feature decoders AggreagtionDecoder 1,2 to achieve the fusion of feature information at different scales. The structure is shown in Fig. 3. The decoder uses multiplication operation and concatenation in the channel dimension to feature interaction, and finally, a convolution layer where Up 2 and Up 4 are linear interpolation operations with scale_factor of 2 and 4 respectively. σ f consists of two sets of convolutional layers with kernel size 3 × 3, a Batch Normalization layer (BN), and an interpolation layer with scale_factor of 8.
Feature fusion module FF. We propose an FF module to effectively combine the encoded information extracted from the Swin-Transformer branch and the U-Net branch, as shown in Fig. 4. The module incorporates a multi-modal mechanism and a linear Hadamard product to achieve an interactive fusion of feature information. The multi-modal mechanism fuses the features extracted by the U-Net and Transformer branches under their respective modalities and feeds the intermediate layer information from each modal output to the next layer to emphasize correlation information under different modalities, as shown in the green dashed box in Fig. 1. Specifically, we construct four FF modules for fusing feature maps of different sizes. With the exception of the first FF module, the remaining three FF modules introduce ff i−1 to achieve feature fusion in different modes. The features map extracted from the Swin-Transformer branch st i , i = 2, 3, 4 , the U-Net branch u i , i = 2, 3, 4 and the ff i−1 , i = 2, 3, 4 obtained from the previous FF are refined using convolution operations to obtain F st i , F ff i−1 and F u i . After that, the features at the same position l are linearly fused (Hadamard product) to obtain the matrix b i . The first FF module incorporates only st 1 and u 1 . The formulation is shown in Eqs. (9) to (12).  The resulting feature ff i effectively captures the global contextual and spatial structure information at the current resolution.
Decoder construction. We pass the multi-scale feature information extracted from the FF module into the Decoder via a skip connection, which is structured to recover the details of the image and output the segmentation results. The overall structure is shown in Fig. 5. In order to suppress irrelevant regions and enable more fine-grained feature interaction fusion, we use the attention-gated module Att 36 to combine the ff i , i = 1, 2, 3 and the up i , i = 2, 3, 4 recovered by the up-sampling, where up 4 is obtained from ff 4 by linear interpolation with a scale_factor of 2. In the Att module, we combine the contextual information provided by ff i and the spatially detailed information recovered by up i+1 , and map them to the interval {-1,1} by using an activation function to obtain the corresponding weights. Then multiply with up i+1 to perform adaptive feature modification to incorporate both shallow and deep-level features. The formulation is shown in Eqs. (14) to (15). W f and W up are linear transformations of ff i and up i+1 using a convolution with kernel size 1 × 1, and then activated by the ReLU function to obtain the fused feature T i . σ is a normalisation function, consisting of a convolution with kernel size 1 × 1 and a Batch Normalization (BN) layer. After the combination of the Att module, the feature map up 1 restores its original resolution by a convolution operation and a linear interpolation operation to output the final segmentation map mask . The whole formulation is shown in Eqs. (16) to (17).
where conv consists of 3 groups of convolution units, each consisting of a convolution with kernel size 3 × 3, a Batch Normalization (BN) layer, and the ReLU activation function. Up is linear interpolation with a scale_factor of 4.   The final loss value is obtained by multiplying the three-part loss by the corresponding weights and adding them together. The formula for calculating the total loss is shown in Eq. (19).
α, β and γ are the corresponding weights, which are adjustable hyper-parameters, the specific values set by the experimental results.

Experiments
Dataset and evaluation metrics. The gastric cancer images dataset used in this work was from the digestive endoscopy center of General Hospital of the People's Liberation Army. The study was conducted according to the principles of the Declaration of Helsinki and in accordance with current scientific guidelines. Approval was given by the Ethics Committee of the Chinese People's Liberation Army General Hospital, and written informed consent was obtained from all subjects and their families.
The acquired gastric cancer images were manually labeled using Labelme software according to the lesion region marked by the expert. Some of the poor-quality images were removed to ensure the experiment's effectiveness, and 630 pairs of original gastric cancer images and corresponding lesion labeled images were finally selected, as shown in Fig. 6. The images in this dataset were selected from various angles and brightness and at different distances. From 630 pairs of gastric cancer images, 100 pairs of images were randomly selected for  www.nature.com/scientificreports/ testing, and the remaining 530 pairs were used for network training. We resize the images to 224 × 224 to make the dataset images of the same size and meet the network training needs. We augment the training set by flipping the images horizontally and vertically, rotating them at any angle, randomizing hue, saturation, brightness transformations, panning, and zooming to prevent overfitting due to the small amount of data. The final training data were selected as 9360 images. From the enhanced training dataset, 5% (468 images) of the images were randomly selected to form the validation dataset.
In addition, we conducted experiments on the Kvasir-SEG 37 and CVC-ClinicDB 38 datasets to evaluate the effectiveness and generalization performance of the proposed method in this paper. The Kvasir-SEG dataset is the first for gastrointestinal disease identification and contains 1000 images of polyp lesions and their corresponding masks. The CVC-ClinicDB dataset includes 612 high-resolution images from 31 colonoscopies. The original images were in "tif " format, which we converted to "png" format. We cropped the images uniformly to 224 × 224 large to fit the network training requirements and divided the training set, validation set, and test set according to the ratio of 8:1:1.
We used the Python and the PyTorch framework to build the experimental environment, and the GTX3080GPU device to complete the network training. The experiment set the epoch size to 300, the batchsize to 16, and the Adam optimizer to update the network weights, setting the Learningrate to 1e −3 and the weightdecay to 1e −4 . We used a pre-trained on Image − 1K mode swin_tiny_patchh4_window7_224 to speed up the network training. We evaluate the segmentation performance of the proposed method, namely IOU, Dice, Accuracy (ACC), Recall (RE), Precision (PR), Specificity (SP) and F1-Score. Evaluation metrics are defined as Eqs. (21) to (27). Where TP, TN, FP, and FN show the true positive, true negative, false positive, and false negative samples, respectively.
Ablation experiments results. We use ablation experiments to investigate the effectiveness of the DFA module and the fusion Transformer and U-Net approaches. Experiments "U-Net" and "ST" used the original U-Net and Swin-Transformer to segment the gastric cancer lesion region. Experiments "U-Net + DFA" and "ST + DFA" replaced the decoder part of the U-Net and Swin-Transformer with the DFA module proposed in this study to evaluate its effectiveness. Experiment "Fusion + FF" uses the original U-Net and Swin-Transformer structures and fuses the feature information output from both using the FF module to verify the effectiveness of the fusion approach. Experiment "Ours" is an experiment on the model proposed in this paper. Table 1 shows the average and standard deviation of the evaluation metrics for the 100 test images, and Table 2 utilizes the "Params" to characterize the number of parameters for each model.  www.nature.com/scientificreports/ As seen in Table 1, the results are the most unsatisfactory when using only the U-Net or the Swin-Transformer for image segmentation, with IOU coefficients reaching only 64.1% and 68.5%. We replaced the decoders in U-Net and Swin-Transformer with DFA modules, i.e., "UNet + DFA" and "ST + DFA", in Table 1. The segmentation results showed a significant improvement, with the U-Net IOU coefficient reaching 72.9%, an improvement of 8.8%, and the Swin-Transformer IOU coefficient reaching 73.9%, an improvement of 5.4%. As seen in Table 2, the use of the DFA module effectively reduces the number of parameters and decreases the complexity of the model compared to the original decoder. After that, we used the FF module to fuse the two branches, the IOU coefficient reached 74.5%, a 6% improvement over the best result of both, proving that fusing two branches using the FF module yields better segmentation results. Using the FF module and DFA module, the IOU coefficient of the fused network reached 81.3%, an improvement of 6.8% compared to the best results above. The best performance in all other evaluation metrics demonstrates the effectiveness of the method proposed in this paper. It is further demonstrated that fusing Swin-Transformer and U-Net can produce better segmentation results. The segmentation results obtained for several network models are shown in Figs. 7 and 8.
As can be seen from Figs. 7 and 8, the segmentation result of (f) is closer to the ground truth labels, once again proving the effectiveness of our proposed method. (e) shows the segmentation results generated by fusing Swin-Transformer and U-Net using the FF module. It can be seen that lesion localization is more accurate than using only Swin-Transformer, and it also focuses on global information and gives better results in the presence of multiple lesions than using only the U-Net. (c) and (d) are the segmentation results obtained by using the DFA. It can be seen that the edges are more evident than in (a) and (b) because the RFB module increases the receptive field while effectively suppressing interference information. Besides, (b) and (d) are segmentation results generated using Swin-Transformer as the backbone. It can be found that the Swin-Transformer architecture pays attention to discontinuous lesion regions compared to the generated results obtained from (a) and (c) using U-Net as the backbone. The result proves that the Transformer is better focused on extracting global contextual information and performs better in modeling explicit long-range relations. The direct comparison between the ground truth labels and the segmentation results in Fig. 8 provides a more intuitive indication of the quality of the segmentation results. It shows that the segmentation results obtained by our proposed model are closer to the actual labels.
Comparative experiments results. In this paper, we also compare our proposed model with several previous image segmentation methods, and the average results are shown in Table 3. For a fair comparison, all experiments use the same data pre-processing, pretraining parameters, and evaluation metrics. Compared with R2U-Net, AttU-Net, PraNet, and DeepLabV3, our IOU indexes improved by 16.8%, 10.4%, 14%, and 4.1%, and the other performance indexes were all optimal values. Compared with TransUNet and TransFuse, which also use the combination of CNNs and Transformers, the IOU indexes improved by 6.7% and 6.8%, which proves that our proposed method is more effective for gastric cancer lesion segmentation. The histogram in Fig. 9 provides a more precise visual comparison of the results of our model with those of other leading models. Figure 10 shows the segmentation results obtained by each model on our dataset. The combination of Figs. 9 and 10 again demonstrates that our model performs well in lesion segmentation of gastric cancer images, yielding high-quality segmentation results with the best segmentation performance.
Validation experiments on public datasets. In our work, we also conducted experiments on the Kvasir-SEG and CVC-ClinicDB datasets to evaluate the generalization performance of the models. All experiments use the same experimental environment, data pre-processing methods, and pre-training parameters. We used IOU, Dice, ACC, RE, and PR to evaluate the experimental results, and the average results are shown in Table 4.
As can be seen from Table 4, on the Kvasir-SEG dataset, the best performing IOU and Dice coefficients are PraNet, but our model differs from it by only 1.2% and 0.04%; the best recall is DeepLabV3, and we differ from Table 1. Comparison of ablation experiment results. *"ST" indicates the Swin-Transformer model, and "Fusion" indicates the fusion of two branches, " DFA " is the deep feature aggregation decoder, and "FF" is the feature fusion module. Bold characters indicate the best performance.  Figure 11 shows the segmentation results obtained for each model on the Kvasir-SEG and CVC-ClinicDB datasets that overlap with the ground truth labels. Red represents the ground truth label, yellow represents the predicted result, and the intersection of both is green. The results show that our proposed model is close to the actual segmentation results and produces high-quality results.

Discussion
The total loss function L = α · L ff + β · L st + γ ·L u , and the weights α, β and γ of its three parts need to be determined by the experimental results. α, β and γ range between [0, 1], and α + β + γ = 1. In Table 1, we have experimentally confirmed that the segmentation results obtained by fusing Swin-transformer and U-Net are satisfactory, and that the segmentation results obtained by using only Swin-Transformer are better than those obtained by using only U-Net. Therefore, on the initial value setting, we define α = 0.5, β = 0.3, γ = 0.2. Table 5 shows that α = 0.5, β = 0.2, γ = 0.3 give the best results. It can be found from the experiments that increasing the U-Net loss weights gives better results than increasing the Swin-Transformer loss weights, which is contrary to our proposed hypothesis. However, Table 1 shows that "U-Net + DFA" is 8.8% better than the U-Net segmentation, "ST + DFA" is 5.4% better than the Swin-Transformer segmentation, and "ST + DFA" is only 0.1% better than "U-Net + DFA". The result demonstrates that the DFA module impacts on the segmentation results and works more effectively than the U-Net in dealing with the gastric cancer image segmentation problem. Therefore, in our experiments, we set α = 0.5, β = 0.2, γ = 0.3.
For a more concrete visualization of the entire area of interest of the model, a heat map was created using Grad-CAM visualization. Grad-CAM 43 uses the network back propagation gradient to calculate the weights of each channel of the feature map to obtain the heat map. Our model focuses on the regions of interest for feature layers down_1 to down_4 , which use the FF module for feature fusion during down-sampling, and feature layers up_1 to up_3 , which recover resolution during up-sampling. The blue and red colors on Grad-CAM indicate lower and higher activation values, respectively. The specific visualization results are shown in Fig. 12. The down-sampling process gradually focuses the network from low-level to high-level semantic features and can pinpoint the location of the lesion. During up-sampling to recover resolution, the model further incorporates low-level semantic features passed through the skip connection to make accurate predictions about the location Figure 7. Segmentation result of gastric cancer images. Image is the original gastric image; Label is the ground truth label; (a) to (e) correspond to the lesion segmentation results obtained from the "U-Net", "ST", "UNet + DFA", "ST + DFA" and "Fusion + FF" in Table 1, respectively. Where (f) is the segmentation result obtained from the model proposed in this paper.

Conclusions
In this paper, we proposed a Dual-Branch Hybrid Network that effectively fuses the Swin-Transformer and the U-Net for lesion segmentation of gastric cancer images. We built the Deep Feature Aggregation Decoder DFA to replace the original decoder structure of the network, effectively reducing the complexity of the model and pinpointing the lesion regions. Besides, we used the FF module to fuse the advantageous features extracted by the U-Net and Transformer, compensating for the lack of global contextual information obtained by the former and the inadequate capture of spatially detailed information in the latter. Our experiments also demonstrated that the FF and DFA modules positively affect the segmentation results. We computed a three-part loss to iteratively train the network, making the segmentation results closer to the ground truth labels. In addition, the region of interest for the entire network model was visualized using Grad-CAM, reflecting side by side that our  www.nature.com/scientificreports/ segmentation network is realistic and practical. Performance indicators showed that our model achieves a very satisfactory 81.3% IOU, 89.5% Dice, and 94.0% accuracy in the segmentation of the lesion region, achieving optimal results in several evaluation metrics and outperforming existing segmentation models. The result of the model was closer to the manual segmentation standard for lesions in gastric cancer images. Our experimental results show that the IOU can still be further improved. In the image segmentation task, the fuzzy labeling of the lesion boundary region with the background region leads to a poor learning ability of the model at the boundary location, which explains the relatively low IOU. In future work, we will improve the IOU by enhancing the ability to extract features from boundary regions. Meanwhile, we need to improve generalization performance to promote it in other medical segmentation domains.

Data availability
The datasets used and/or analyzed during the current study available from the corresponding author on reasonable request.